Analysis and calculations in service of planning the 2021 follow-up experiment.
Load the DNA sample data and compute the concentration of input material for each aliquot in terms of pellets per aliquot, accounting for different homogenization volumes. Here I approximate the homogenization volume by the amount of PBS added; this is likely an ok approximation for getting relative concentration differences within sample types, which is the main purpose here.
dna <- here("output", "sample-data", "dna-sample-data.Rds") %>%
readRDS %>%
mutate(
pellets_per_aliquot = number_of_pellets / volume_pbs * volume_per_aliquot
) %>%
glimpse
#> Rows: 100
#> Columns: 27
#> $ dna_sample_id <chr> "1_1", "1_2", "1_3", "1_4", "2_1", "2_2"…
#> $ specimen_id <fct> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4…
#> $ aliquot_number <fct> 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2…
#> $ specimen_type <fct> T. mobilis, T. mobilis, T. mobilis, T. m…
#> $ collection_week <fct> NA, NA, NA, NA, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ collection_day <fct> NA, NA, NA, NA, 1, 1, 1, 1, 2, 2, 2, 2, …
#> $ collection_date <date> NA, NA, NA, NA, 2019-09-09, 2019-09-09,…
#> $ host_species <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ host_sex <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ extraction_batch <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ extraction_date <fct> 2019-12-03, 2019-12-03, 2019-12-03, 2019…
#> $ number_of_pellets <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ number_of_aliquots <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
#> $ volume_pbs <dbl> 1000, 1000, 1000, 1000, 1000, 1000, 1000…
#> $ volume_per_aliquot <dbl> 200, 200, 200, 200, 200, 200, 200, 200, …
#> $ extraction_protocol <fct> 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2…
#> $ dna_conc <dbl> 0.279, 0.958, 0.357, 1.185, 0.283, 1.159…
#> $ dna_conc_s1 <dbl> 0.080, 0.720, NA, NA, 0.080, 0.761, 0.00…
#> $ dna_conc_s2 <dbl> 0.358, 1.550, NA, NA, 0.272, 3.200, 0.26…
#> $ well <chr> "C4", "F9", "H11", "H12", "G7", "E5", "A…
#> $ row <ord> C, F, H, H, G, E, A, G, D, E, H, E, C, F…
#> $ column <ord> 4, 9, 11, 12, 7, 5, 3, 3, 10, 12, 3, 10,…
#> $ A1 <lgl> TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TR…
#> $ A2 <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
#> $ S1 <lgl> TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TR…
#> $ S2 <lgl> TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TR…
#> $ pellets_per_aliquot <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, …
dna %>%
count(specimen_type, number_of_aliquots, number_of_pellets,
pellets_per_aliquot
) %>%
kable()
| specimen_type | number_of_aliquots | number_of_pellets | pellets_per_aliquot | n |
|---|---|---|---|---|
| Fecal | 4 | 1 | 0.200 | 64 |
| Inoculum | 2 | 4 | 1.778 | 2 |
| Inoculum | 4 | 1 | 0.200 | 12 |
| Inoculum | 4 | 4 | 0.941 | 4 |
| Inoculum | 4 | 5 | 1.176 | 4 |
| Inoculum | 4 | 6 | 1.412 | 8 |
| T. mobilis | 2 | 10 | 4.000 | 2 |
| T. mobilis | 4 | 1 | 0.200 | 4 |
In the second extraction batch, Angie pooled pellets of the inoculum and T. mobilis, and also added less PBS, in order to increase the biomass.
We have three different DNA concentration measurements for most samples: Our own Qubit measurements, the Picogreen measurements from center S1, and the Qubit measurements from center S2. Let’s take a quick look at how they compare.
dna %>%
mutate(across(starts_with("dna_conc"), ~log10(. + 1e-2))) %>%
ggplot(aes(x = .panel_x, y = .panel_y,
color = specimen_type, fill = specimen_type)) +
geom_point(alpha = 0.5) +
ggforce::geom_autodensity(alpha = 0.5, na.rm = TRUE) +
ggforce::facet_matrix(vars(starts_with("dna_conc")), layer.diag = 2) +
theme_minimal_grid() +
scale_color_brewer(type = "qual") +
scale_fill_brewer(type = "qual")
What does this imply about where certain measurements might be over/under-estimating changes in yield?
Given the general agreement between the three measurement types, let’s do some exploration using our measurements (which we have for a larger set of samples). Let’s consider yield vs expected biomass as determined by the number of pellets per aliquot.
p0 <- dna %>%
ggplot(aes(pellets_per_aliquot, dna_conc,
color = specimen_type, shape = extraction_batch)) +
scale_x_log10() +
scale_y_log10() +
scale_color_brewer(type = "qual", palette = 2) +
scale_shape_manual(values = c(1, 3)) +
theme_minimal_grid() +
coord_fixed() +
facet_wrap(~extraction_protocol, labeller = "label_both")
p1 <- p0 +
geom_quasirandom(groupOnX = TRUE, width = 0.2)
p2 <- p0 +
geom_text(aes(label = specimen_id),
position = position_quasirandom(groupOnX = TRUE, width = 0.2))
p1 / p2
The bottom panel is the same as the top but with points labeled by specimen id.
Some discussion taken from Section “Differences in DNA yields” in the planning doc:
Some other observations:
For Protocol 2, the relationship between pellets per aliquot (PPA) and yield looks roughly proportional across the two batches. For Protocol 1, it looks somewhat less than proportional for the inoculum samples, but greater than proportional for the high PPA T.mob sample. Possibly of note is that the second batch went through a further freeze-thaw cycle (all sample types) and suspension and pelleting (inoculum and T. mobilis types) (CHECK); this might be expected to increase yields though this is unclear.
Might be worth looking at the other DNA concentration measurements, which might give some evidence as to whether our concentration measurements themselves are linear at the low end.
y <- dna %>%
pivot_longer(starts_with("dna_conc")) %>%
filter(!is.na(value)) %>%
mutate(
value = value / pellets_per_aliquot
) %>%
with_groups(
c(specimen_type, extraction_protocol, extraction_batch, name),
# geometric median
summarize, across(value, ~exp(median(log(.))))
) %>%
pivot_wider(names_from = specimen_type) %>%
mutate(across(c(Inoculum, `T. mobilis`), ~ Fecal / .)) %>%
pivot_longer(c(Inoculum, `T. mobilis`), names_to = "specimen_type")
y %>%
ggplot(aes(extraction_batch, value, color = specimen_type)) +
geom_quasirandom() +
scale_y_log10(breaks = 10^seq(-2, 4)) +
facet_grid(name~extraction_protocol, labeller = "label_both") +
expand_limits(y = 1) +
theme_minimal_hgrid() +
scale_color_brewer(type = "qual", palette = 1) +
theme(
legend.position = "bottom",
panel.spacing.y = unit(10, "mm")
)
Can see that extraction protocol 2 consistent results indicate a multiplier of around 10X is needed, while extraction protocol 1 indicates a factor of 100X or more is needed, depending on which concentration measurements we use. Splitting the difference and aiming for 30X increase in input biomass per aliquot is a reasonable compromise, but shooting for 100X is a safer bet if we want to ensure that the dilution window overlaps the fecal composition in the original experiment.
Goal is to choose mock compositions (in terms of volume of pure culture) for the 2021 experiment using the abundance profiles from the 2019 experiment.
The aim is to have three mock communities with the following properties:
I will use the measurements from all four sequencing centers to help determine mixing volumes for mocks M2 and M3.
ps <- here("output/phyloseq/2020-11-28-phyloseq.Rds") %>%
readRDS %>%
mutate_sample_data(., sample_sum = sample_sums(.))
tree <- here("output/strain-data/gtdb-representatives.tree") %>%
ape::read.tree()
Let’s filter to the inoculum strains and E. coli Nissle, and aggregate to species level.
Note that for the amplicon ASVs that match E. coli HS and E. coli Nissle 1917, we have multiple entries in organism and strain_group. This is important to keep in mind when working with the amplicon data and doing taxonomic filtering or merging.
tax <- tax_table(ps) %>% as_tibble
tax.ec <- tax %>% filter(str_detect(organism, "Esch"))
tax.ec %>% select(.otu, species:source)
#> # A tibble: 6 x 5
#> .otu species organism strain_group source
#> <chr> <chr> <chr> <chr> <chr>
#> 1 v2_A1_A… Escherich… Escherichia coli HS; E… Inoculum; Mouse … A1
#> 2 v2_A1_A… Escherich… Escherichia coli Nissl… Mouse contaminant A1
#> 3 v2_A2_A… Escherich… Escherichia coli HS; E… Inoculum; Mouse … A2
#> 4 v2_A2_A… Escherich… Escherichia coli Nissl… Mouse contaminant A2
#> 5 GCF_000… Escherich… Escherichia coli HS Inoculum RefSeq
#> 6 GCF_003… Escherich… Escherichia coli Nissl… Mouse contaminant RefSeq
tax0 <- tax %>%
filter(
strain_group == "Inoculum" | species == "Escherichia coli",
str_detect(organism, "Esch")
)
all.equal(tax.ec, tax0)
#> [1] TRUE
rm(tax.ec, tax0)
I also filter to the primary sample types and removal of the low-count samples from A2,
ps1 <- ps %>%
filter_sample_data(
specimen_type %in% c("Fecal", "Inoculum"),
sample_sum >= 1e4
) %>%
filter_tax_table(
strain_group == "Inoculum" | species == "Escherichia coli",
) %>%
tax_glom("species") %>%
mutate_tax_table(.otu = species)
ps1 %>% tax_table %>% as_tibble %>% select(phylum, species) %>%
arrange(phylum) %>% kable
| phylum | species |
|---|---|
| Actinobacteriota | Collinsella aerofaciens |
| Bacteroidota | Bacteroides ovatus |
| Bacteroidota | Bacteroides uniformis |
| Bacteroidota | Bacteroides thetaiotaomicron |
| Bacteroidota | Bacteroides caccae |
| Bacteroidota | Barnesiella intestinihominis |
| Firmicutes | Clostridium symbiosum |
| Firmicutes | Roseburia intestinalis |
| Firmicutes | Faecalibacterium prausnitzii |
| Firmicutes | Marvinbryantia formatexigens |
| Firmicutes | Eubacterium rectale |
| Proteobacteria | Escherichia coli |
| Verrucomicrobiota | Akkermansia muciniphila |
I will work with the abundances as compositional vectors, with a pseudo-proportion of 1e-3 added to account for detection limits and to moderate the extreme log-fold differences that can arise due to some organisms not being present or detectable in all samples.
ps1.1 <- ps1 %>%
transform_sample_counts(close_elts) %>%
transform_sample_counts(~center_elts(. + 3e-3))
Let’s take a look at all the data,
p <- ps1.1 %>% as_tibble %>%
ggplot(aes(y = abbreviate(.otu), x = .abundance, color = specimen_type,
shape = extraction_protocol:center_id)
) +
facet_wrap(~extraction_batch,
labeller = label_both) +
scale_x_log10() +
scale_color_brewer(type = "qual") +
scale_shape_manual(values = c(1, 16, 2, 17, 5, 18, 3, 8)) +
theme_minimal_hgrid() +
theme(axis.title.y = element_blank())
p +
geom_quasirandom(groupOnX = FALSE)
Note, I think the shotgun bioinformatics protocol is expected to be biased against E. coli, and we see that here for the inoculum samples but not in the fecal samples.
ps1.1 %>%
filter_tax_table(species == "Escherichia coli") %>%
as_tibble %>%
ggplot(aes(y = .otu, x = .abundance, color = specimen_type,
shape = extraction_protocol)
) +
facet_grid(center_id~extraction_batch, labeller = label_both) +
scale_x_log10() +
scale_color_brewer(type = "qual") +
scale_shape_manual(values = c(1, 16)) +
theme_minimal_hgrid() +
theme(
axis.title.y = element_blank(),
axis.text.y = element_blank()
) +
geom_quasirandom(groupOnX = FALSE) +
plot_annotation(
title = "E. coli abundance relative to geometric mean"
)
Let’s try to compute the compositional difference between fecal and inoculum for each batch, protocol, center combination.
#> Order by the phylogeny,
lvls <- tax %>%
filter(.otu %in% tree$tip.label) %>%
mutate(across(.otu, factor, levels = tree$tip.label)) %>%
arrange(.otu) %>%
pull(species) %>%
unique
x <- ps1.1 %>%
as_tibble %>%
group_by(.otu, specimen_type, extraction_protocol, center_id,
extraction_batch, specimen_id
) %>%
summarise(across(.abundance, gm_mean), .groups = "drop_last") %>%
summarise(across(.abundance, gm_mean), .groups = "drop") %>%
mutate(across(.otu, factor, levels = lvls)) %>%
arrange(.otu)
y <- x %>%
pivot_wider(names_from = specimen_type, values_from = .abundance) %>%
mutate(diff = Fecal / Inoculum)
y %>%
mutate(across(.otu, fct_reorder, diff)) %>%
ggplot(aes(y = abbreviate(.otu), x = diff,
color = center_id:extraction_protocol)
) +
facet_grid(~extraction_batch, labeller = label_both) +
scale_x_log10() +
scale_color_brewer(type = "qual", palette = 2) +
theme_minimal_hgrid() +
geom_quasirandom(groupOnX = F, width = 0.2) +
theme(axis.title.y = element_blank())
# geom_vline(xintercept = c(0.1, 1, 10))
This graph indicates that the compositional difference between fecal and inoculum samples is fairly consistent across extraction and sequencing protocol, which is a nice confirmation of our expectation from the MWC model. There are, however, some interesting modest exceptions. For example, Ec and Clss are reduced in S{1,2}:E1 protocols.
Something notable is that Clss is reduced in the second experiment relative to the first. Angie reported a much lower OD for Clss in the second inoculum batch, so this makes some sense.
Let’s use the geometric average to choose the mixing ratios,
m2 <- y %>%
with_groups(.otu, summarize,
across(c(Fecal, Inoculum), gm_mean)) %>%
mutate(
diff = Fecal / Inoculum,
diff_sqrt = sqrt(diff)
) %>%
arrange(diff)
m2 %>% kable
| .otu | Fecal | Inoculum | diff | diff_sqrt |
|---|---|---|---|---|
| Faecalibacterium prausnitzii | 0.115 | 2.430 | 0.047 | 0.218 |
| Collinsella aerofaciens | 0.167 | 1.644 | 0.102 | 0.319 |
| Roseburia intestinalis | 0.598 | 3.582 | 0.167 | 0.409 |
| Escherichia coli | 2.164 | 7.980 | 0.271 | 0.521 |
| Marvinbryantia formatexigens | 0.401 | 0.790 | 0.508 | 0.713 |
| Eubacterium rectale | 0.149 | 0.169 | 0.881 | 0.938 |
| Clostridium symbiosum | 1.295 | 0.774 | 1.674 | 1.294 |
| Akkermansia muciniphila | 2.444 | 1.173 | 2.084 | 1.444 |
| Bacteroides caccae | 1.833 | 0.821 | 2.233 | 1.494 |
| Barnesiella intestinihominis | 0.288 | 0.094 | 3.075 | 1.754 |
| Bacteroides uniformis | 5.147 | 1.600 | 3.217 | 1.793 |
| Bacteroides thetaiotaomicron | 5.063 | 1.229 | 4.121 | 2.030 |
| Bacteroides ovatus | 15.427 | 0.479 | 32.214 | 5.676 |
Simple option: make the ratios 1 for the first 6, for the next 6, and 4 for the final (Bcto).
m2 <- m2 %>%
mutate(
rank = rank(diff),
ratio = case_when(
rank <= 6 ~ 1,
between(rank, 7, 12) ~ 2,
rank == 13 ~ 4,
)
)
For the third mock, aim to make the percentage of 16S reads (more) even. Similar to Mock 2, I will use just three rounded ratio values for simplicity of pipetting.
m3 <- y %>%
filter(center_id %in% c("A1", "A2")) %>%
with_groups(.otu, summarize,
across(c(Fecal, Inoculum), gm_mean)) %>%
mutate(
target = 1 / Inoculum,
target_sqrt = sqrt(target)
) %>%
arrange(target)
m3 %>% kable
| .otu | Fecal | Inoculum | target | target_sqrt |
|---|---|---|---|---|
| Escherichia coli | 3.590 | 7.265 | 0.138 | 0.371 |
| Roseburia intestinalis | 0.724 | 3.398 | 0.294 | 0.542 |
| Faecalibacterium prausnitzii | 0.101 | 2.884 | 0.347 | 0.589 |
| Akkermansia muciniphila | 4.414 | 1.747 | 0.572 | 0.756 |
| Bacteroides uniformis | 4.038 | 1.600 | 0.625 | 0.791 |
| Collinsella aerofaciens | 0.180 | 1.519 | 0.658 | 0.811 |
| Bacteroides thetaiotaomicron | 3.479 | 1.228 | 0.814 | 0.902 |
| Bacteroides caccae | 1.538 | 0.975 | 1.025 | 1.013 |
| Clostridium symbiosum | 1.470 | 0.643 | 1.555 | 1.247 |
| Marvinbryantia formatexigens | 0.261 | 0.560 | 1.786 | 1.336 |
| Bacteroides ovatus | 10.894 | 0.539 | 1.857 | 1.363 |
| Eubacterium rectale | 0.153 | 0.162 | 6.165 | 2.483 |
| Barnesiella intestinihominis | 0.348 | 0.088 | 11.394 | 3.376 |
m3 <- m3 %>%
mutate(
rank = rank(target),
ratio = case_when(
rank <= 1 ~ 1,
between(rank, 2, 11) ~ 1.5,
rank >= 12 ~ 3,
)
)
m3 %>% kable
| .otu | Fecal | Inoculum | target | target_sqrt | rank | ratio |
|---|---|---|---|---|---|---|
| Escherichia coli | 3.590 | 7.265 | 0.138 | 0.371 | 1 | 1.0 |
| Roseburia intestinalis | 0.724 | 3.398 | 0.294 | 0.542 | 2 | 1.5 |
| Faecalibacterium prausnitzii | 0.101 | 2.884 | 0.347 | 0.589 | 3 | 1.5 |
| Akkermansia muciniphila | 4.414 | 1.747 | 0.572 | 0.756 | 4 | 1.5 |
| Bacteroides uniformis | 4.038 | 1.600 | 0.625 | 0.791 | 5 | 1.5 |
| Collinsella aerofaciens | 0.180 | 1.519 | 0.658 | 0.811 | 6 | 1.5 |
| Bacteroides thetaiotaomicron | 3.479 | 1.228 | 0.814 | 0.902 | 7 | 1.5 |
| Bacteroides caccae | 1.538 | 0.975 | 1.025 | 1.013 | 8 | 1.5 |
| Clostridium symbiosum | 1.470 | 0.643 | 1.555 | 1.247 | 9 | 1.5 |
| Marvinbryantia formatexigens | 0.261 | 0.560 | 1.786 | 1.336 | 10 | 1.5 |
| Bacteroides ovatus | 10.894 | 0.539 | 1.857 | 1.363 | 11 | 1.5 |
| Eubacterium rectale | 0.153 | 0.162 | 6.165 | 2.483 | 12 | 3.0 |
| Barnesiella intestinihominis | 0.348 | 0.088 | 11.394 | 3.376 | 13 | 3.0 |
Create a table with the ratios for all three mocks, and check the total volumes
m <- bind_rows(
M1 = m2 %>% transmute(.otu, ratio = 1),
M2 = m2 %>% select(.otu, ratio),
M3 = m3 %>% select(.otu, ratio),
.id = "mock"
) %>%
arrange(.otu)
m %>% pivot_wider(names_from = mock, values_from = ratio) %>% kable
| .otu | M1 | M2 | M3 |
|---|---|---|---|
| Escherichia coli | 1 | 1 | 1.0 |
| Bacteroides uniformis | 1 | 2 | 1.5 |
| Bacteroides ovatus | 1 | 4 | 1.5 |
| Bacteroides caccae | 1 | 2 | 1.5 |
| Bacteroides thetaiotaomicron | 1 | 2 | 1.5 |
| Barnesiella intestinihominis | 1 | 2 | 3.0 |
| Akkermansia muciniphila | 1 | 2 | 1.5 |
| Faecalibacterium prausnitzii | 1 | 1 | 1.5 |
| Marvinbryantia formatexigens | 1 | 1 | 1.5 |
| Eubacterium rectale | 1 | 1 | 3.0 |
| Roseburia intestinalis | 1 | 1 | 1.5 |
| Clostridium symbiosum | 1 | 2 | 1.5 |
| Collinsella aerofaciens | 1 | 1 | 1.5 |
m %>% with_groups(mock, summarize, across(ratio, sum)) %>% kable
| mock | ratio |
|---|---|
| M1 | 13 |
| M2 | 22 |
| M3 | 22 |
Let’s adjust the final volumes so that the mocks all sum to ~66 mL while still having round numbers,
m_vol <- m %>%
mutate(
mult = case_when(
mock == "M1" ~ 5,
mock != "M1" ~ 3
),
# across(mult, ~ . * 2), # double for two bottle option
volume = ratio * mult,
)
m_vol %>% with_groups(mock, summarize, across(volume, sum)) %>% kable
| mock | volume |
|---|---|
| M1 | 65 |
| M2 | 66 |
| M3 | 66 |
m_vol %>% with_groups(.otu, summarize, across(volume, sum)) %>% kable
| .otu | volume |
|---|---|
| Escherichia coli | 11.0 |
| Bacteroides uniformis | 15.5 |
| Bacteroides ovatus | 21.5 |
| Bacteroides caccae | 15.5 |
| Bacteroides thetaiotaomicron | 15.5 |
| Barnesiella intestinihominis | 20.0 |
| Akkermansia muciniphila | 15.5 |
| Faecalibacterium prausnitzii | 12.5 |
| Marvinbryantia formatexigens | 12.5 |
| Eubacterium rectale | 17.0 |
| Roseburia intestinalis | 12.5 |
| Clostridium symbiosum | 15.5 |
| Collinsella aerofaciens | 12.5 |
m_vol_wide <- m_vol %>%
select(.otu, mock, volume) %>%
pivot_wider(names_from = mock, values_from = volume)
m_vol_wide %>% kable
| .otu | M1 | M2 | M3 |
|---|---|---|---|
| Escherichia coli | 5 | 3 | 3.0 |
| Bacteroides uniformis | 5 | 6 | 4.5 |
| Bacteroides ovatus | 5 | 12 | 4.5 |
| Bacteroides caccae | 5 | 6 | 4.5 |
| Bacteroides thetaiotaomicron | 5 | 6 | 4.5 |
| Barnesiella intestinihominis | 5 | 6 | 9.0 |
| Akkermansia muciniphila | 5 | 6 | 4.5 |
| Faecalibacterium prausnitzii | 5 | 3 | 4.5 |
| Marvinbryantia formatexigens | 5 | 3 | 4.5 |
| Eubacterium rectale | 5 | 3 | 9.0 |
| Roseburia intestinalis | 5 | 3 | 4.5 |
| Clostridium symbiosum | 5 | 6 | 4.5 |
| Collinsella aerofaciens | 5 | 3 | 4.5 |
ss <- "https://docs.google.com/spreadsheets/d/1mqrCoBj7y2lajEY3VyiPOy98S9HAY-YRKyQLEKI4LOs"
m_vol_wide %>%
googlesheets4::sheet_write(ss, sheet = "[R export] Mixing volumes")
In the fecal samples, aliquots 1 and 3 were extrated by protocol 1, and aliquots 2 and 4 by protocol 2. Our analysis assumes that all four aliquots are equivelent prior to DNA extraction. Is there any evidence to the contrary, in which case we might want to randomize the protocol treatment across aliquots in the new experiment?
Did our homogenization and aliquot-assignment procedure led to any systematic differences in the DNA yield between the first and second aliquot for a given specimen-protocol combination?
z <- dna %>%
filter(specimen_type == "Fecal") %>%
with_groups(c(specimen_id, extraction_protocol), mutate,
aliquot_rank = rank(aliquot_number) %>% factor)
z %>%
ggplot(aes(y = aliquot_rank, x = dna_conc)) +
facet_grid(extraction_batch ~ extraction_protocol) +
geom_quasirandom(groupOnX = FALSE) +
scale_x_log10()
Let’s plot the concentration from the second aliquot against the first,
z %>%
select(specimen_id, starts_with("extraction"), dna_conc, aliquot_rank) %>%
mutate(across(aliquot_rank, fct_recode, first = "1", second = "2")) %>%
pivot_wider(names_from = aliquot_rank, values_from = dna_conc) %>%
ggplot(aes(y = second, x = first,
color = extraction_protocol:extraction_batch)) +
scale_x_log10() + scale_y_log10() + coord_fixed() +
scale_color_brewer(type = "qual", palette = "Paired") +
theme(
legend.position = "bottom"
) +
geom_abline(color = "grey") +
geom_point()
There appears to be a 1:1 relationship between yields from the first to second aliquot of a given specimen-protocol set. This result suggests that the homogenization procedure successfully homogenized aliquots 1 and 3, and aliquots 2 and 4.
I looked at this question in the pilot data (A1 data, v1 pipeline) in analysis/2020-02-29-homogenization.html, and will take another look here using all the data.
One simple thing we can try is to compare the CLR values between the first and second aliquots of a given specimen-protocol combination, as with DNA yield above.
x <- ps1.1 %>%
as_tibble %>%
mutate(
aliquot_rank = case_when(
aliquot_number %in% 1:2 ~ "First",
aliquot_number %in% 3:4 ~ "Second",
TRUE ~ NA_character_
)
) %>%
filter(!is.na(aliquot_rank)) %>%
select(.otu, .abundance, center_id, specimen_id, starts_with("extraction"),
aliquot_rank) %>%
pivot_wider(names_from = aliquot_rank, values_from = .abundance)
#> # A tibble: 3 x 3
#> `is.na(First)` `is.na(Second)` n
#> <lgl> <lgl> <int>
#> 1 FALSE FALSE 148
#> 2 FALSE TRUE 33
#> 3 TRUE FALSE 6
A number of cases do not have both First and Second values; perhaps this is due to the samples that were filtered from low read counts or left out of the centers that received fewer than 96 samples?
#> # A tibble: 9 x 4
#> center_id `is.na(First)` `is.na(Second)` n
#> <fct> <lgl> <lgl> <int>
#> 1 A1 FALSE FALSE 42
#> 2 A1 FALSE TRUE 6
#> 3 A2 FALSE FALSE 28
#> 4 A2 FALSE TRUE 10
#> 5 A2 TRUE FALSE 6
#> 6 S1 FALSE FALSE 36
#> 7 S1 FALSE TRUE 11
#> 8 S2 FALSE FALSE 42
#> 9 S2 FALSE TRUE 6
x %>%
ggplot(aes(First, Second, color = extraction_protocol:extraction_batch)) +
facet_grid(abbreviate(.otu) ~ center_id) +
scale_x_log10() + scale_y_log10() + coord_fixed() +
scale_color_brewer(type = "qual", palette = "Paired") +
theme(
legend.position = "bottom"
) +
geom_abline(color = "grey") +
geom_point()
We see a strong 1:1 relationship across taxa, suggesting that our procedure led to replicate aliquots that behaved approximately the same in the experiment.
First, get a table with the basic info for each specimen
mocks <- tibble(mixture = str_c("M", 1:3)) %>%
crossing(feces_added = c(TRUE, FALSE)) %>%
mutate(
specimen_base_id = str_c(mixture, ifelse(feces_added, "F", "N")),
specimen_type = ifelse(feces_added, "Mock in feces", "Mock only")
)
gnot <- tibble(
specimen_base_id = str_c("F", 1:3),
specimen_type = "Fecal"
)
tmob <- tibble(
specimen_base_id = "T1",
specimen_type = "T. mobilis",
dilution_power = 0L,
feces_added = FALSE
)
all_specimens <- bind_rows(mocks, gnot) %>%
crossing(
dilution_power = 0:2
) %>%
bind_rows(tmob) %>%
mutate(
dilution_label = str_c("D", dilution_power),
dilution_factor = 10^dilution_power,
specimen_name = str_c(sep = "-",
specimen_base_id,
dilution_label
)
) %>%
relocate(specimen_name) %>%
glimpse
#> Rows: 28
#> Columns: 8
#> $ specimen_name <chr> "M1N-D0", "M1N-D1", "M1N-D2", "M1F-D0", "M1…
#> $ mixture <chr> "M1", "M1", "M1", "M1", "M1", "M1", "M2", "…
#> $ feces_added <lgl> FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, FALS…
#> $ specimen_base_id <chr> "M1N", "M1N", "M1N", "M1F", "M1F", "M1F", "…
#> $ specimen_type <chr> "Mock only", "Mock only", "Mock only", "Moc…
#> $ dilution_power <int> 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2…
#> $ dilution_label <chr> "D0", "D1", "D2", "D0", "D1", "D2", "D0", "…
#> $ dilution_factor <dbl> 1, 10, 100, 1, 10, 100, 1, 10, 100, 1, 10, …
The DNA samples correspond to the four aliquots obtained from each specimen (with the last two aliquots not used for specimens derived from M3 and F3),
dna_samples <- all_specimens %>%
crossing(aliquot_number = 1:4) %>%
# Remove aliquots 3 and 4 from specimens not getting replicates (M3, F3)
filter(
! (specimen_base_id %in% c("M3N", "M3F", "F3") & aliquot_number %in% 3:4)
) %>%
mutate(
dna_sample_id = str_c(sep = "-", specimen_name, aliquot_number),
extraction_protocol = ifelse(aliquot_number %in% c(1, 3), 1, 2) %>% factor
) %>%
relocate(dna_sample_id)
Well assignment is simplified this time since we are using all 94 samples for all sequencing centers. Wells H11 and H12 are reserved for A1 controls and will be left blank in all plates. I will manually assign the T. mobilis samples, then randomly assign others. This time, I will put T. mobilis controls in distinct rows and columns, roughly evenly spaced across the plate. Rows: A, C, E, G; Cols: 1, 4, 7, 10. In A1, the Zymo control will be in H11 or H12; I’ll put the T. mobilis samples going from lower left (A1) to upper right (G10) to space away from the Zymo sample.
set.seed(4)
reserved_wells <- str_c("H", 11:12)
tmob_wells <- dna_samples %>%
filter(specimen_type == "T. mobilis") %>%
arrange(aliquot_number) %>%
mutate(
row = LETTERS[c(1, 3, 5, 7)],
column = seq(1, 12, by = 3) %>% as.integer %>% rev,
well = str_c(row, column)
)
other_samples <- dna_samples %>%
filter(specimen_type != "T. mobilis") %>%
pull(dna_sample_id)
other_wells <- crossing(row = LETTERS[1:8], column = 1:12) %>%
mutate(well = str_c(row, column)) %>%
filter(!(well %in% c(tmob_wells$well, reserved_wells))) %>%
mutate(dna_sample_id = sample(other_samples)) %>%
left_join(dna_samples, by = "dna_sample_id")
plate_layout <- bind_rows(other_wells, tmob_wells) %>%
arrange(dna_sample_id) %>%
mutate(
across(row, ordered, levels = LETTERS[1:8]),
across(column, ordered, levels = seq(12))
)
plate_layout %>%
mutate(across(row, fct_rev)) %>%
ggplot(aes(y = row, x = column, fill = specimen_type)) +
coord_fixed() +
geom_tile() +
geom_text(aes(label = specimen_base_id), size = 3) +
scale_fill_brewer(type = "qual", palette = 2) +
labs(title = "Plate layout") +
theme(
legend.position = "bottom"
)
Looks good!
Let’s save the DNA sample data with well assignments in a Google Sheet. I’ll also save the well assignments in the form of a 12 by 8 grid with the full DNA sample names, to refer to during the actual plating.
ss <- "https://docs.google.com/spreadsheets/d/10DNhnhE17lPOSIvzbhYxDTG--9KDMkM9Eum9Psb9B10"
plate_layout %>%
googlesheets4::sheet_write(ss, sheet = "[R export] DNA sample names and wells")
plate_layout %>%
select(dna_sample_id, row, column) %>%
pivot_wider(names_from = column, values_from = dna_sample_id,
names_sort = TRUE) %>%
arrange(row) %>%
rename(`row//column` = row) %>%
googlesheets4::sheet_write(ss, sheet = "[R export] Plate layout")
sessioninfo::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.1.0 (2021-05-18)
#> os Arch Linux
#> system x86_64, linux-gnu
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz America/New_York
#> date 2021-06-29
#>
#> ─ Packages ──────────────────────────────────────────────────────────────────────
#> package * version date lib source
#> ade4 1.7-16 2020-10-28 [1] CRAN (R 4.0.3)
#> ape 5.5 2021-04-25 [1] CRAN (R 4.1.0)
#> askpass 1.1 2019-01-13 [1] CRAN (R 4.0.0)
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.0)
#> backports 1.2.1 2020-12-09 [1] CRAN (R 4.0.3)
#> beeswarm 0.4.0 2021-06-01 [1] CRAN (R 4.1.0)
#> Biobase 2.52.0 2021-05-19 [1] Bioconductor
#> BiocGenerics 0.38.0 2021-05-19 [1] Bioconductor
#> biomformat 1.20.0 2021-05-19 [1] Bioconductor
#> Biostrings 2.60.0 2021-05-19 [1] Bioconductor
#> bitops 1.0-7 2021-04-24 [1] CRAN (R 4.1.0)
#> broom 0.7.6 2021-04-05 [1] CRAN (R 4.0.5)
#> bslib 0.2.5.1 2021-05-18 [1] CRAN (R 4.1.0)
#> cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.0.0)
#> cli 2.5.0 2021-04-26 [1] CRAN (R 4.1.0)
#> cluster 2.1.2 2021-04-17 [2] CRAN (R 4.1.0)
#> codetools 0.2-18 2020-11-04 [2] CRAN (R 4.1.0)
#> colorspace 2.0-2 2021-06-24 [1] CRAN (R 4.1.0)
#> cowplot * 1.1.1 2020-12-30 [1] CRAN (R 4.0.4)
#> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.4)
#> curl 4.3.1 2021-04-30 [1] CRAN (R 4.1.0)
#> data.table 1.14.0 2021-02-21 [1] CRAN (R 4.0.4)
#> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.4)
#> dbplyr 2.1.1 2021-04-06 [1] CRAN (R 4.0.5)
#> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3)
#> distill 1.2 2021-01-13 [1] CRAN (R 4.1.0)
#> downlit 0.2.1 2020-11-04 [1] CRAN (R 4.0.3)
#> dplyr * 1.0.7 2021-06-18 [1] CRAN (R 4.1.0)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.0)
#> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0)
#> farver 2.1.0 2021-02-28 [1] CRAN (R 4.0.4)
#> forcats * 0.5.1 2021-01-27 [1] CRAN (R 4.0.4)
#> foreach 1.5.1 2020-10-15 [1] CRAN (R 4.0.3)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
#> gargle 1.1.0 2021-04-02 [1] CRAN (R 4.0.5)
#> generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.3)
#> GenomeInfoDb 1.28.0 2021-05-19 [1] Bioconductor
#> GenomeInfoDbData 1.2.6 2021-05-31 [1] Bioconductor
#> ggbeeswarm * 0.6.0 2017-08-07 [1] CRAN (R 4.0.0)
#> ggforce 0.3.3 2021-03-05 [1] CRAN (R 4.0.4)
#> ggplot2 * 3.3.5 2021-06-25 [1] CRAN (R 4.1.0)
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
#> googledrive 1.0.1 2020-05-05 [1] CRAN (R 4.0.0)
#> googlesheets4 0.3.0 2021-03-04 [1] CRAN (R 4.1.0)
#> gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.0)
#> haven 2.4.1 2021-04-23 [1] CRAN (R 4.1.0)
#> here * 1.0.1 2020-12-13 [1] CRAN (R 4.0.5)
#> highr 0.9 2021-04-16 [1] CRAN (R 4.1.0)
#> hms 1.1.0 2021-05-17 [1] CRAN (R 4.1.0)
#> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.3)
#> httr 1.4.2 2020-07-20 [1] CRAN (R 4.0.2)
#> igraph 1.2.6 2020-10-06 [1] CRAN (R 4.0.3)
#> import 1.2.0 2020-09-24 [1] CRAN (R 4.0.2)
#> IRanges 2.26.0 2021-05-19 [1] Bioconductor
#> iterators 1.0.13 2020-10-15 [1] CRAN (R 4.0.3)
#> jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.1.0)
#> jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.0.3)
#> kableExtra * 1.3.4 2021-02-20 [1] CRAN (R 4.0.4)
#> knitr 1.33 2021-04-24 [1] CRAN (R 4.1.0)
#> labeling 0.4.2 2020-10-20 [1] CRAN (R 4.0.3)
#> lattice 0.20-44 2021-05-02 [2] CRAN (R 4.1.0)
#> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.4)
#> lubridate 1.7.10 2021-02-26 [1] CRAN (R 4.0.4)
#> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3)
#> MASS 7.3-54 2021-05-03 [2] CRAN (R 4.1.0)
#> Matrix 1.3-3 2021-05-04 [2] CRAN (R 4.1.0)
#> metacal * 0.2.0.9001 2021-06-15 [1] Github (mikemc/metacal@a7a87a1)
#> mgcv 1.8-35 2021-04-18 [2] CRAN (R 4.1.0)
#> modelr 0.1.8 2020-05-19 [1] CRAN (R 4.0.0)
#> multtest 2.48.0 2021-05-19 [1] Bioconductor
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.0)
#> nlme 3.1-152 2021-02-04 [2] CRAN (R 4.1.0)
#> nvimcom * 0.9-102 2021-06-15 [1] local
#> openssl 1.4.4 2021-04-30 [1] CRAN (R 4.1.0)
#> patchwork * 1.1.1 2020-12-17 [1] CRAN (R 4.0.3)
#> permute 0.9-5 2019-03-12 [1] CRAN (R 4.0.0)
#> phyloseq * 1.36.0 2021-05-19 [1] Bioconductor
#> pillar 1.6.1 2021-05-16 [1] CRAN (R 4.1.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0)
#> plyr 1.8.6 2020-03-03 [1] CRAN (R 4.0.0)
#> polyclip 1.10-0 2019-03-14 [1] CRAN (R 4.0.0)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.0)
#> progress 1.2.2 2019-05-16 [1] CRAN (R 4.0.2)
#> ps 1.6.0 2021-02-28 [1] CRAN (R 4.0.4)
#> purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.0.0)
#> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.3)
#> rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.0.4)
#> RColorBrewer 1.1-2 2014-12-07 [1] CRAN (R 4.0.0)
#> Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.3)
#> RCurl 1.98-1.3 2021-03-16 [1] CRAN (R 4.0.5)
#> readr * 1.4.0 2020-10-05 [1] CRAN (R 4.0.3)
#> readxl 1.3.1 2019-03-13 [1] CRAN (R 4.0.0)
#> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.0.5)
#> reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.0.0)
#> rhdf5 2.36.0 2021-05-19 [1] Bioconductor
#> rhdf5filters 1.4.0 2021-05-19 [1] Bioconductor
#> Rhdf5lib 1.14.0 2021-05-19 [1] Bioconductor
#> rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0)
#> rmarkdown * 2.8 2021-05-07 [1] CRAN (R 4.1.0)
#> rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.3)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.3)
#> rvest 1.0.0 2021-03-09 [1] CRAN (R 4.0.5)
#> S4Vectors 0.30.0 2021-05-19 [1] Bioconductor
#> sass 0.4.0 2021-05-12 [1] CRAN (R 4.1.0)
#> scales 1.1.1 2020-05-11 [1] CRAN (R 4.0.0)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.0)
#> speedyseq * 0.5.3.9018 2021-06-29 [1] Github (mikemc/speedyseq@ceb941f)
#> stringi 1.6.2 2021-05-17 [1] CRAN (R 4.1.0)
#> stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.0.0)
#> survival 3.2-11 2021-04-26 [2] CRAN (R 4.1.0)
#> svglite 2.0.0 2021-02-20 [1] CRAN (R 4.0.4)
#> systemfonts 1.0.2 2021-05-11 [1] CRAN (R 4.1.0)
#> tibble * 3.1.2 2021-05-16 [1] CRAN (R 4.1.0)
#> tidyr * 1.1.3 2021-03-03 [1] CRAN (R 4.0.4)
#> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0)
#> tidyverse * 1.3.1 2021-04-15 [1] CRAN (R 4.1.0)
#> tweenr 1.0.2 2021-03-23 [1] CRAN (R 4.0.5)
#> useful 1.2.6 2018-10-08 [1] CRAN (R 4.0.0)
#> utf8 1.2.1 2021-03-12 [1] CRAN (R 4.0.5)
#> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0)
#> vegan 2.5-7 2020-11-28 [1] CRAN (R 4.0.3)
#> vipor 0.4.5 2017-03-22 [1] CRAN (R 4.0.0)
#> viridisLite 0.4.0 2021-04-13 [1] CRAN (R 4.0.5)
#> webshot 0.5.2 2019-11-22 [1] CRAN (R 4.0.2)
#> withr 2.4.2 2021-04-18 [1] CRAN (R 4.0.5)
#> xfun 0.23 2021-05-15 [1] CRAN (R 4.1.0)
#> xml2 1.3.2 2020-04-23 [1] CRAN (R 4.0.0)
#> XVector 0.32.0 2021-05-19 [1] Bioconductor
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0)
#> zlibbioc 1.38.0 2021-05-19 [1] Bioconductor
#>
#> [1] /home/michael/R/x86_64-pc-linux-gnu-library/4.0
#> [2] /usr/lib/R/library